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Abstract — Recent work on the mathematical foundations of op- 
timization has begun to uncover its rich structure. In particular, 
the “No Free Lunch” (NFL) theorems state that any two algo- 
rithms are equivalent when their performance is averaged across 
all possible problems. This highlights the need for exploiting 
problem-specific knowledge to achieve better than random per- 
formance. In this paper we present a general framework covering 
more search scenarios. In addition to the optimization scenarios 
addressed in the NFL results, this framework covers multi-armed 
bandit problems and evolution of multiple co-evolving players. As 
a particular instance of the latter, it covers “self-play” problems. 
In these problems the set of players work together to produce 
a champion, who then engages one or more antagonists in a 
subsequent multi-player game. In contrast to the traditional 
optimization case where the NFL results hold, we show that in 
self-play there are free lunches: in coevolution some algorithms 
have better performance than other algorithms, averaged across 
all possible problems. We consider the implications of these 
results to biology where there is no champion. 


I. Introduction 

Recently, the mathematical foundations of optimization have 
begun to be uncovered [?], [?], [?], [?], [?], [?], [?], [?], 
[?]. One particular result in this work, the “No Free Lunch” 
(NFL) theorems, establishes the equivalent performance of 
all optimization algorithms when averaged across all possible 
problems. 1 As an example of these theorems, recent work has 
explicitly constructed objective functions where random search 
outperforms evolutionary algorithms [?]. There has also been 
much work extending these early results to different types 
of optimization (e.g. to multi-objective optimization [?]). The 
web site www . no- free- lunch . org offers a list of recent 
references. 

However, all this previous work has been cast in a man- 
ner that does not cover repeated game scenarios where the 
“objective” or “fitness” function for one player or agent can 
vary based on the response of another player. In particular, 
the NFL theorems do not cover such scenarios. These game- 
like scenarios are usually called “coevolutionary” since they 
involve the behaviors of more than a single agent or Dlaver 
[?]• 

One important example of coevolution is “self-play,” where 
from the system designer’s perspective, the players “cooper- 
ate” to train one of them as a champion. That champion is 
then pitted against an antagonist in a subsequent multi-player 
game. The goal is to train that champion player to perform 
as well as possible in that subsequent game. For a checkers 
example see [?]. 

1 More precisely, the algorithms must be compared after they have examined 
the same number of distinct configurations in the search space. 


Early work on coevolutionary scenarios includes [?], [?], 
[?]. More recently, coevolution has been used for problems 
that on the surface appear to have no connection to a game (for 
an early application to sorting networks see [?]). Coevolution 
in these cases enables escape from poor local optima in favor 
of better local optima. 

We will refer to all players other than the one of direct atten- 
tion as that player’s “opponents,” even when, as in self-play, 
the players can be viewed as cooperating. Sometimes when 
discussing self-play we will refer to the specific opponent to 
be faced by a champion in a subsequent game — an opponent 
not under our control — as the champion’s “antagonist.” 

In this paper we present a mathematical framework that 
covers both traditional optimization and coevolutionary sce- 
narios. It also covers other scenarios such as multi-armed 
bandits. We then use that framework to explore the differences 
between traditional optimization and coevolution. We find 
dramatic differences between the traditional optimization and 
coevolutionary scenarios. In particular, unlike the fundamental 
NTL result for traditional optimization, in the self-play domain 
there are algorithms that are superior to other algorithms for 
all problems. However in the typical coevolutionary scenarios 
encountered in biology, where there is no champion, NFL still 
holds. 

Section 13 summarizes the previous NFL work that we 
extend, and Section HI motivates these extensions. Section IV 
presents the resultant extended NFL framework, and provides 
example illustrations of the framework. Section V applies the 
NFL extensions to self play, and Section VI demonstrates 
that NFL results need not apply in this case. We conclude 
in Section VII. 


II. Background 


Motivated by the myriad heuristic approaches to combinato- 
rial optimization a number of researchers have sought insight 
into how best to match optimization algorithms to problems. 
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the approach taken in that paper as it forms the starting point 
for our coevolutionary extensions. 

We consider search over a finite space X and assume 
that the associated space of possible “fitness” or “objective 
function” values Y is also finite. The sizes of the spaces 
are \X\ and \Y\ respectively. The space of possible fitness 
functions, F = Y x , contains |V|! X possible mappings from 
X to Y. A particular mapping in F is indicated as / 6 F. 
All of the results mentioned in this section be extended to the 
case of stochastic fitness functions specified by conditional 
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distributions P(y 6 Y | x G X) rather than single-valued 
fumctioas from X to Y. (This is explicitly demonstrated below 
ivh<T we introduce the generalized version of the original 
NFL framework.) However for pedagogical simplicity here 
we restrict attention to single- valued f’s. We are interested 
in. Che performance of algorithms when averaged across some 
distribution P(f) of such single-valued fitness functions. 

Hie formalization of algorithms used in [?] is motivated 
by the behavior of algorithms like genetic algorithms, simu- 
fatcd annealing, and tabu search. All such algorithms sample 
elements of the search space (i.e., select an i G X), and 
evaluate the fitness y = f(x) € Y of that sample. New x’s are 
selected based upon previously sampled x’s and the associated 
fitness values. At an iteration at which a total of m distinct 
a’s have been examined we write those x’s and associated 
fitness values as an ordered set of m distinct configurations: 

- Configura- 

tions in d rn are ordered according to the time at which the 
algorithm sampled them. Thus, d^(t) is the f’tli sampled x and 
d^it) ~ /(d^(t)) is the associated fitness. The ordered sets 
of all X and Y values is indicated as and dlf respectively. 
Algorithms are compared on the basis of the samples d m that 
ties generate. 

In is important to note that the x’s in d ^ must all be distinct. 
This means that algorithms are compared only on the basis 
of (the unique x’s they have examined. This does not mean 
tSiat algorithms that do revisit x’s (as genetic algorithms and 
s im ulated annealing typically do) cannot be compared. Rather, 
i* means they must be compared based on the number of 
distinct x’s they have examined. Further discussion of this 
point is found in [?]. 

Based upon these definitions an algorithm is a (perhaps non- 
deterministic) mapping from a set of samples d rn to a new 
(ie.., not yet visited) point in the search space, + 

1). That mapping is specified by the probability distribution 
+ 1) = x|d TO ) defined over X which gives the 
probability of the algorithm selecting x at time rn + 1. To en- 
sure that search space points are not revisited we require zero 
probability on previously visited x’s. Thus, P m (d^ +1 (m + 

1 j — x|d m ) — 0 for all i r 1^. The algorithm begins with 
the selection of a starting configuration as specified by an 
initial distribution P x (df(l) = x). An algorithm a is then a 
sjeciScation of the probability distributions P x , P 2 , etc. (This 
definition of a search algorithm was also used in [?] in the 
case where the mapping was assumed to be deterministic.) 
WLtHi every visit to a new search space element the set of 
samples is extended from d m to include the new x and its 
fi mess, i.e., d-m+i = dm U {x, ^(x)} so that (m T 1) = x 
and + 1) — f( x )■ While covering many classes 

otf ailgorithms (like simulated annealing, genetic algorithms, 
tabu, search, etc), not all algorithms are of this type (e.g., 
etnurnerative algorithms like branch and bound). The results 
presented here do not necessarily apply to algorithms outside 
time- class we consider. 

Tie efficacy of a search algorithm is assessed with a 
perfoormance measure, <!>(dyj, which is a function of all the 

2 T!is set was called a trace in [?]. 


fitness values seen by the algorithm by step m. It is important 
to note that this measure of performance differs from the 
typical concerns of computationally complexity. We are not 
concerned with run times or memory issues. The performance 
of an algorithm a after having visited m distinct x’s, averaged 
over a class of optimization problems specified with a distribu- 
tion P{f), isE($|m,a) = ^ /6F <E>(d^)P«|/,m,a)P(/). 
When P(f) is uniform over any set of functions which 
is closed under permutations 3 then it can be shown that 
P{dy n \m,a) = P{d v m \m,a, f)P(f) is independent of a 
[?], [?], [?]. Thus, the expected performance of any pair of 
algorithms is equal under that average. The most general form 
for P(f) for which NFL results remain valid is derived in [?]. 

[?] considers many extensions of this basic result and 
shows that algorithms may be distinguished once we look 
beyond simply average performance. Results independent of 
the distribution over problems P(f) may also be derived [?]. 

Our purpose here is to extend the framework discussed 
above to coevolutionary settings where there is more than 
a single player. As we shall see, such an extension can 
be developed which addresses many problems of interest in 
both evolutionary and coevolutionary optimization. Before 
presenting that extended framework formally, we motivate its 
extensions through consideration of an idealized coevolution- 
ary optimization problem, and the Farmed bandit. 

Iir. Motivation 

A. Self Play 

We can view the NFL framework reviewed above as a 
“game” in which a single player is trying to determine what 
“move” x it should make to optimize $( d ^ n ). As an example of 
another type of problem we would like to study we consider 
self-play. This extension involves moves of more than one 
player, even though there is still a single <E> and /. For example, 
in the case of two players the fitness function depends upon 
the moves of both players, indicated as x and x. 

To illustrate this consider a multi-stage game involving 
the two players [?], [?], like checkers [?]. Have the players 
be computer programs. In this setting x and x are the two 
complete computer programs that compete with each other, 
rather than the plays they make at any particular stage. These 
programs, fixed at the beginning of the game, specify each 
player’s entire strategy of what play to make in response to 
what set of preceding observations. It is these programs that 
are of interest. In noncooperative game theory these programs 
are called “normal form strategies”. In other applications, x 
might represent an algorithm to sort a list, and x a mutable 
set of lists to be sorted. The payoff / then reflects the ability 
of the algorithm to sort the lists in x. 

In self-play we fix attention to the payoff to one of the 
two players, the “champion”, with the other player being the 
“opponent”. A fitness function /(x, x) gives the reward to the 
champion (e.g., +1 for a pair of strategies in which it wins, 
0 for an indeterminate or drawing pair, and —1 for a losing 
pair). Now concatenate the strategies of our player (x) and the 

3 P(f) is closed under permutations if for any permutation a : X — 1 • X of 
inputs then P(f) = P(f o cr). 
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opponent ( x ) into a single joint point x = [x, x\. By doing this 
we do not need to generalize fitness functions when we extend 
the NFL framework; the fitness function remains a mapping 
from I=IxI into Y . Now, however, X is the space of 
joint (champion, opponent) game strategies. 

In the more general form of self-play this approach is 
extended by having several players compete in a tournament, 
and from the results of that tournament selecting a single 
best agent. That best agent constitutes the champion, who will 
compete against an antagonist in a subsequent game. The goal 
is to design the tournament to produce the champion with the 
best possibility of beating the antagonist. We would like to 
assess the efficacy of various such designs, and to see if NFL- 
like results also hold in this game setting. 

When designing a self-play tournament there are two differ- 
ent choices to make. First, one must decide how the “training 
games” are selected, i.e., how each set of all the players’ 
strategies for the next round of the tournament are chosen 
based on the results of the preceding rounds. Second, one 
must decide how to use the outcomes of all those games to 
select the champion. 

As in the original NFL work, the m distinct 
training games and their fitnesses are indicated as 
dm = {(^(l),^(l)),---(d^(m),d^(m))}. Analogously 
we write the probabilistic mapping that selects each new 
training game’s strategies based on the results of the preceding 
ones as a set of conditional distributions. We write that set as 
the “algorithm” a , exactly as in the original NFL framework. 

Choosing a champion is done with a function A which 
maps a completed sequence of training games, d m , into a 
champion strategy/move. We parameterize that champion as 
the associated subset of all joint strategies X consistent with 
it. For example, say that our champion strategy takes the role 
of the first player in a 2-player subsequent game with a single 
antagonist. In other words, our champion is a choice of a 
(hopefully) optimal first player’s strategy. So we choose that 
champion by selecting a particular value x* for the strategy x 
of the first player in the subsequent game. Since that choice of 
strategy doesn’t restrict the antagonist’s responses, we indicate 
it as the subset of all x € X with x = x*, i.e. the subset 
{(x,xj|x € X}. So A maps d m to such a subset of X. (In 
the more general approach of Sec. V A is allowed to map 
probability distributions over X, not just subsets of X.) 

How do we judge the performance of the champion when 
we do not know how the antagonist will act? One possibility 
is to measure the performance of the champion against the 
antagonist who performs best against the champion. If the 
champion plays the game according to x* , then this worst 
case measure may be written as min x /(x*,x) where x ranges 
over all possible opponent strategies. Having defined A(d m ) 
as above we can also write the worst case performance 
as min xSj 4 ( (im ) /(x). A good champion will maximize this 
worst possible performance. Here we see the first difference 
from the original single-player NFL scenario. In that original 
optimization setting performance is solely a function of d v m 
(the observed game outcomes), here however, the maximin 
criteria has an explicit dependence on the fitness function /. 
As we shall see, it is this dependence which will give rise to 


free lunches in which there can be a priori differences between 
algorithms. 

Other possible means of quantifying the performance of 
the champion are possible, and in some cases preferable. 
Subtleties in evaluating the performance of game-playing 
strategies are considered in [?], [?], [?]. In this work we 
concentrate on the maximin measure, but we expect that if the 
performance measure depends explicitly on / then genetically 
NFL type results will not hold. 

B. Bandit Problems 

The fc-armed bandit problem is simple, but captures much 
of the essence of the critical exploration/exploitation tradeoff 
inherent in optimization. In this problem an agent is faced with 
repeatedly choosing between k stochastic processes having 
different means. With each selection (either process 1, process 
2, • • • , process k ) the agent receives a reward stochastically 
sampled from the process it chooses. The agent’s goal is to 
maximize the total reward collected over m selections. One 
simple strategy is to sample each process n times for a total of 
kn training points, and for the remaining m — kn time steps to 
sample that process which has the higher empirical mean based 
on the n points sampled from each process. An algorithm of 
this type was proposed (erroneously) as justification for the 
schema theorem of genetic algorithms [?], [?]. 

In order to allow NFL-like analyses to apply to algorithms 
for bandit problems we must generalize the notion of a fitness 
function. In this case the fitness of any given x value (x = i for 
selecting process i ) is not deterministic, but stochastic, given 
by sampling the associated process. To capture this we extend 
the definition “fitness function” from a X — > Y mapping 
to mapping from X — > Z, where Z is a space capturing 
probabilistic models. This is illustrated below. 

IV. General Framework 

As we have seen from these two examples to increase the 
scope of NFL-like analyses we need to make two slight exten- 
sions. Firstly, we must broaden the definition of performance 
measures to allow for dependence on /, and secondly, we need 
to generalize fitness functions to allow for non-determinism. 
The resultant framework is closely related to the one used 
in the very first work on NFL, preceding its application to 
the problem of search, namely NFL for supervised machine 
learning [?], [?], [?]. 

A. Formal framework specification 

ti — — ~~~~ v" ~~ a *7 T/n 
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a typical scenario might have x e X be the joint strategy 
followed by our players, and z 6 Z be one of the possible 
probability distributions over some space of possible rewards 
to the champion. 

In addition to X and Z, we also have a fitness function 

f.X^Z. (1) 

In the example where z is a probability distribution over 
rewards, / can be viewed as the specification of an x- 
conditioned probability distribution of rewards. In particular, 
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single-valued fitness functions are special cases of such an /, 
where each f(x) — each ^-conditioned probability distribu- 
tion — is a delta function about some associated reward value. 
Different such / give different single-valued mappings from 
x to rewards. The introduction of Z into the framework is 
what allows for noisy payoffs, and to allow it to cover bandit 
problems. 

We have a total of m time-steps, and represent the samples 
generated through those time-steps as 

d m EE = ({<£(«)}£!. {«)}£t)- 

As in classic NFL each cf^(f) is a particular x £ X. 
Each d z m (t) is a (perhaps stochastic) function of /(d^(t)). 
For example, say z’s — values of fix) — are probability 
distributions over reward values. Then d^lt) could consist of 
the full distribution /(cF(t)). Alternatively, it could consist 
of a moment of that distribution, or even a random sample 
of it. In general, we allow the function specifying d^(t) to 
vary with t. However that freedom will not be exploited here. 
Accordingly, we will leave that function implicit, to minimize 
the notation. As shorthand we will write d(t) to mean the pair 

A search algorithm , a, is an initial distribution Pi(<i^(l)) 
of the initially selected point d^( 1) 6 X, together with a set 
of m - 1 separate conditional distributions P t (d^(f) | d t ~ i) 
for t — 2, . . . , m. Such an algorithm specifies which x to next 
choose, based on the samples observed so far, for any time- 
step t. As is usual, we assume that the next x has not been 
previously seen. This is reflected as an implicit restriction on 
the conditional distributions P t (d^(f) | d t -i)- 

Finally, we have a (potentially vector-valued) cost function , 
C(d m ,f), which is used to assess the performance of the 
algorithm. Often our goal is to find the a that will maximize 
E(Cj for a particular choice of the mapping forming the 
dm(t)' s from the /(d^(f))’s. This expectation E(C) is formed 
by averaging over any stochasticity in the mapping from 
f's to associated d£,(i)’s. It also averages over those fitness 
functions / consistent (in the sense of Bayes’ theorem) with 
the observed samples d m . (See below for examples.) 

The NFL theorems concern averages over all / of quantities 
depending on C. For those theorems to hold — for /- averages 
of C to be independent of the search algorithm — it is 
crucial that for fixed d m , C does not depend on /. When that 
independence is relaxed, the NFL theorems need not hold. As 
we have seen such relaxation occurs in self-play; it is how one 
can have free lunches in self-play. 

B. Examples of the framework 

Example 1: One example of this generalized framework 
is the scenario considered in the original NFL theorems. 
There we can identify Z with a distribution over Y where 
L is a subset of M (for convenience we take X and Y 
countable). For single-valued fitness functions, as remarked 
above, such distributions must be delta functions. In this case 
the implicit mapping from /(dj^(f)) to the associated d^(f) 
is given simply by evaluating the real value / has at 
As an alternative formulation, for such fitness functions we 


can instead define z e Z to be the same as Y. (Recall that 
Z need not be a space of probability distributions; that’s only 
the choice of Z used for illustrative purposes.) In the more 
general version of the original NFL scenario Z is a non-delta 
function over Y, and the mapping /(d^(t)) to the associated 
dm(t) is given by forming a sample of f(d^(t)). 

In the scenario of the original NFL theorems a does not 
allow revisits. In addition we take C(d m ,f ) = ${d m ) (recall 
the definition of the performance measure $ in section II). 
As already noted, for NFL to hold it is critical that the cost 
function does not depend on /. It is also crucial that the search 
algorithm a not allow revisits. Both apply to the formulation 
given here. Accordingly, the NFL theorems generically apply 
to scenarios which can be cast as an instance of this example. 

Example 2: While the variables are interpreted differently 
(e.g., x is now a joint strategy, not a single sample point), 
the formal specification of self-play in terms of our extended 
framework is almost identical to that of the original (noisy 
fitness function) NFL scenario. The only formal difference 
between the scenarios arises in the choice of C. 

In self-play we use the set of repeated games, together with 
any other relevant information we have (e.g., how the game 
against the antagonist might differ from the games heretofore), 
to choose the champion strategy to be used in the subsequent 
game against the antagonist. As we have seen, this dependence 
is given by a function A(d m ) mapping d rn to a subset of X. 
Since it measures performance against the antagonist, C must 
involve this specification of the champion. 

Formally, C uses A to determine the quality of the search 
algorithm that generated d m as follows: 

C(d m ,f)= min E (/). (2) 

xeA(dm) 

where E(/) is the expected value of the distribution of rewards 
our champion receives for a joint strategy with the antagonist 
given by x: 

E(/)= ^j/P/(j/ | x) = ^y[[/(x)](j/)] (3) 

y€Y yer 

where [\f(x)}{y)} is the distribution f(x) evaluated at y. 

This cost function is the worst possible payoff to the 
champion. There are several things to note about it. First, it still 
applies if the number of players in any game is greater than 
2 (the number of players just determines the dimensionality 
of x, and the form of the function A). Also A arises nowhere 
in our formulation of self-play but in this specification of C. 
Finally, note that the C of self-play depends on /. 

Say we have a 2-player self-play scenario and the antagonist 
has no care for any goal other than hurting our champion. Say 
that the antagonist is also omnipotent (or at least very lucky), 
and chooses the x which achieves its goal. Then the expected 
reward to the champion is given by Eq. (2). Obvious variants 
of this setup replace the worst-case nature of C with some 
alternative, have A be stochastic, etc. 

Whatever variant we choose, typically our goal in self-play 
is to choose a and/or A so as to maximize E (C), with the 
expectation now extending to average all possible /. The fact 
that C depends on / means that NFL need not apply though. 
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Examples of this are presented below, in Sections V, V-B, and 
VI. 

Example 3: Another example is the k- armed bandit problem 
introduced for optimization by Holland [?], and analyzed 
thoroughly in [?]. The scenario for that problem is identical to 
that for the NFL results, except that there are no constraints 
that the search algorithm not revisit previously sampled points, 
Y = R, and every 2 is a Gaussian. The fact that revisits are 
allowed (since typically m > k) means that NFL need not 
apply. 

Example 4: In the general biological coevolution scenario 
[?], [?], [?] there is a set of “players” who change their 
strategies from one game to the next, just like in self-play. 
Unlike in self-play though, each player has an associated 
frequency in the population of all players, and that frequency 
also varies through the succession of games. This means that 
the two scenarios are quite different when formulated with 
our framework. Moreover, the formulation for the general 
coevolution scenario involves definitions of Z , /, etc., that 
would appear counter-intuitive if we were to interpret them 
the same way we do in self-play. 

We formulate the general coevolution scenario with our 
framework by having a set of N agents (or players, or cultures, 
or lineages of genomes, or lineages of genes, etc), just like 
in self-play. Their strategy/move spaces are written X it as in 
self-play. Now however X is extended beyond the current joint 
strategy to include the joint “population frequency” value of 
those strategies. Formally, we write 

X = (Xi,ui) x ■ ■ ■ x (X N ,u N ), (4) 

and interpret each Xi <£ X t as a strategy of i and each u, € 1 
as a frequency with which i occurs in the overall population 
of all players. 4 

To be more precise, we interpret Xi(t ) as V s current strat- 
egy. However we interpret u,(t) as z’s previous population 
frequency, i.e., the population frequency, at the preceding 
timestep, of the strategy that i followed then. In other words, 
we interpret the m component of d^(t) as the population 
frequency at timestep t — 1 of the strategy followed by agent i 
then, a strategy given by the Xi component of — 1). So 
the information concerning each agent i is “staggered” across 
pairs of successive timesteps. This is done so that a can give 
the sequence of joint population frequencies that accompanies 
the sequence of joint strategies, as described below. 

When i is a single agent, this choice of X accomodates 
learning in i by allowing the strategy of i, x l 6 X t , to change 
from one timestep to the next. Vidisn 1 is a lineage of a 
gene” it is not the strategy (i.e., the gene) that changes from 
one timestep to the next, but the associated frequency of that 
strategy in the population. This too is accomodated in our 
choice of X; changes in X from one time-step to the next can 
involve changes in the joint-frequency without any changes 
in the joint-strategy. More generally, our formulation allows 
both kinds of changes to occur simultaneously . In addition, 

4 Typically YLi u i = 1 °f course, though we have no need to explicitly 

require this here. Indeed, the formalism allows the Ui not to be population 
frequencies, but rather integer-valued population counts. 


mutation, e.g., modification of the gene, can be captured with 
this framework. This is done by having some i’ s that at certain 
times have 0 population frequency, but then stochastically 
jump to non-0 frequency, representing a new agent that is a 
mutant of an old one. 

Have each z be a probability distribution over the possible 
current population frequencies of the agents. So given our 
definition of X, we interpret / as a map taking the previous 
joint population frequency, together with the current joint 
strategy of the agents, into a probability distribution over the 
possible current joint population frequencies of the agents 

As an example, in evolutionary game theory, the joint strat- 
egy of the agents at any given t determines the change in each 
one’s population frequency in that time-step. Accordingly, in 
the replicator dynamics of evolutionary game theory, / takes a 
joint strategy x\X...xn and the values of all agents’ previous 
population frequencies, and based on that determines the new 
value of each agent’s population frequency. More precisely, 
d^(t) is a sample of that distribution 

In this general coevolution scenario, our choice for a, 
which produces d^ L (t + 1) from d x Jt), plays two roles. 
These correspond to its updates of the strategy components of 
and of the population frequency components of d^(t), 
respectively. More precisely, one of the things a does is update 
the population frequencies from those of the previous timestep 
t — 1 (which are stored in to the ones given by d z n (t). 

This means directly incorporating those population frequencies 
into the {u t } components of d T m (t,+l). The other thing a does, 
as before, is determine the joint strategy [xi , . . . , xjy] for time 
t + 1. At the risk of abusing notation, as in self-play we can 
write the generation of the new strategy of each agent i by 
using a (potentially stochastic and/or time-varying) function 
written a;. In sum then, an application of a to a common d t is 
given by the simultaneous operation of all those N distinct a, 
on d t , as well as the transfer of the joint population frequency 
from d z (t). The result of these two processes is d x (t + 1). 

Note that the new joint strategy produced by a may depend 
on the previous time-step’s population frequencies, in gen- 
eral. As an example, this corresponds to sexual reproduction 
in which mating choices are stochastic, so that how likely 
agent i is to mate with agent j depends on the population 
frequencies of agents i and j 5 However in the simplest version 
of evolutionary game theory, the joint strategy is actually 
constant in time, with the only thing that varies in time 
being the population frequencies, updated in /. If the agents 
are identified with distinct genomes, then in this version of 
evolutionary game theory reproduction is parthenogenic. 

The choice of C depends on what one wishes to know 
about a sequence d m and /. Typical analyses performed in 
population biology and associated fields have C be a vector 
with N components, each component j depending only on 
the associated d t {j). As an example, typically in evolutionary 
game theory each component is j's population frequency at 

5 Obvious elaborations of the framework allow X to include relative rewards 
between agents in the preceding round, as well as the associated population 
frequencies. This elaboration would allow mate selection to be based on 
current differential fitness between candidate mates, as well as their overall 
frequency in the population. 
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t- 1. 

In general in such biological analyses there is no notion of a 
champion being produced by the search and subsequently pit- 
ted against an antagonist in a “bake-off.” (Famously, evolution 
is not teleological.) Accordingly, unlike in self-play, there is 
no particular significance to results for alternative choices of 
C that depend on / in such analyses. This means that so long 
as we make the approximation, reasonable in real biological 
systems, that x’s are never revisited, all of the requirements 
of Example 1 are met. This means that NFL applies. 

Some authors have promoted the use of the general coevo- 
lution scenario as a means of designing an entity to perform 
well, rather than as a tool for analyzing how a system happens 
to develop. In general, whether or not NFL applies to such a 
use will depend on the details of the design problem. 

For example, say the problem is to design a value y that 
maximizes a provided function g(y), e.g., design a biological 
organ that can function as an optical sensor. Then, even if 
we are in the general coevolutionary scenario of interacting 
populations, we can still cast the problem as an instance of 
the choices of Z, /, etc., of Example 1. In particular, for 
our design problem C does not involve any “subsequent game 
against an antagonist”, and C is independent of /. So the NFL 
theorems hold; the extra details of the dynamics introduced 
by coevolution don’t affect the validity of those theorems, 
which is independent of such details. On the other hand, 
say the problem is to design an organism that is likely to 
avoid extinction (i.e., have a non-zero population frequency) 
in the years after a major change to the ecosystem. For this 
problem the coevolution scenario is a variant of self-play; the 
“years after the major change to the ecosystem” constitute the 
“subsequent game against an antagonist”. In this situation NFL 
may not hold. 

There are other ways one can express the general coevo- 
lution scenario in our framework, i.e., other choices for the 
roles of /, a, etc. The advantage of the one used here is how it 
formally separates the different aspects of the problem. / plays 
the role of the laws of Nature which map joint strategies and 
population frequencies to new population frequencies (e.g., 
the replicator dynamics). All variability in how one might 
update strategies — cross-over, mutation, etc. — are instead 
encapsulated in a. In particular, if one wishes to compare two 
such update schemes, without knowing anything about / ahead 
of time or being able to modify it, that means comparing two 
different a’s, while / is fixed and not something we can have 
any knowledge about. 

V. Application to Self-play 

In section III-A we introduced a model of self-play. In the 
remainder of this paper we show how free lunches may arise 
in this setting, and quantify the a priori differences between 
certain self-play algorithms. 

To summarize self play, we recall that agents (game strate- 
gies) are paired against each other in a (perhaps stochastically 
formed) sequence to generate a set of 2-player games. After 
m distinct training games between an agent and its opponents, 
the agent enters a competition. Performance of the agent is 


measured with a payoff function. The payoff function to the 
agent when it plays strategy x and it’s opponent plays x is 
written as f(x) where x = (x,x) is the joint strategy. We 
make no assumption about the structure of strategies except 
that they are finite. 

We define the payoff for the agent playing strategy x 
independent of an opponent’s reply, g(x), as the least payoff 
over all possible opponent responses: g(x) = min^f(x,x). 
With this criterion, the best strategy an agent can play is 
that strategy which maximizes g (a maximin criterion) so that 
its performance in competition (over all possible opponents) 
will be as good as possible. We are not interested in search 
strategies just across the agent, but more generally across the 
joint strategies of the agent and its opponents. (Note that 
whether that opponent varies or not is irrelevant, since we 
are setting its strategies.) The ultimate goal is to maximize 
the agents performance g. 

We make one important observation. In general, using a 
random pairing strategy in the training phase will not result in 
a training set that can be used to guarantee that any particular 
strategy in the competition is better than the worst possible 
strategy. The only way to ensure an outcome guaranteed to 
be better than the worst possible is to exhaustively explore all 
possible responses to strategy x, and then determine that the 
worst value of / for all such joint strategies is better than the 
worst value for some other strategy, x'. To do this requires 
that m is greater than the total number of possible strategies 
available to the opponent, but even for very large m unless all 
possible opponent responses have been explored we can not 
make any such guarantees. 

Pursuing this observation further, consider the situation 
where we know (perhaps through exhaustive enumeration of 
opponent responses) that the worst possible payoff for some 
strategy x is g(x) and that another joint strategy x' — (x',x') 
with x 7 ^ x' results in a payoff f(x') < g(x). In this case 
there is no need to explore other opponent responses to x' 
since it must be that g(x') < g(x), i.e., x' is maximin inferior 
to x. Thus, in designing an algorithm to search for good 
strategies, any algorithm that avoids searching regions that are 
known to be maximin inferior (as above) will be more efficient 
than one that searches these regions (e.g., random search). 
This applies for all g, and so the smarter algorithm will 
have an average performance greater than the dumb algorithm. 
Roughly speaking, this result avoids NFL implications because 
varying uniformly over all g does not vary uniformly over all 
possible /, which are the functions that ultimately determine 
performance. 

In the following sections we develop this observation fur- 
ther. 


A. Definitions 

We introduce a few definitions to explore our observation. 
We assume that there are l strategies available to an agent, and 
label these using X = [1, • • • ,/]. For each such strategy we 
assume the opponent may choose from one of l(x) possible 
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strategies forming the space X(x). 6 Consequently, the size of 
the joint strategy space is |AT[ = Yx=i Kx). For simplicity 
we take X(x) to be independent of x so that X = [1, • • • ,1} 
and |X| = U. If the training period consists of m distinct joint 
strategies, even with m as large as \X\ -l, we cannot guarantee 
that the agent will not choose the worst possible strategy in 
the competition as the worst possible strategy could be the 
opponent response that was left unexplored for each of the l 
possible strategies. 

As always, a sample of configurations (here configurations 
are joint strategies) is a sample of distinct points from the input 
space X, and their corresponding fitness values. For simplicity 
we assume that fitness payoffs are a deterministic function 
of joint strategies. Thus, rather than the more general output 
space Z, we assume payoff values lie in a finite totally ordered 
space Y. Consequently, the fitness function is the mapping 
f : X —> Y where X = X_ x X is the space of joint strategies. 
As in the general framework, a sample of size m is represented 
as 

d m = {(<&( 1); <&(!)),- • • , d? m {m))} 

where d^ (t) = {<&(<), d£,(t)} and d^t) = f (<4(t) , d*Jt)) 
and t £ [1, ■ ■ • , to] labels the samples taken. In the above defi- 
nition dm(t ) is the t’th strategy adopted by the agent, d T m (t) is 
the opponent response, and d^(t) is the corresponding payoff. 
As usual, we assume that no joint configurations are revisited, 
and that an algorithm a defined exactly as in the classic 
NFL case is used to generate sample sets d m . A particular 
coevolutionary optimization task is specified by defining the 
payoff function that is to be maximized. As discussed in [?], 
a class of problems is defined by specifying a probability 
density P(f) over the space of possible payoff functions. As 
long as both X and Y are finite (as they are in any computer 
implementation) this is conceptually straightforward. 

There is an additional consideration in the coevolutionary 
setting, namely the decision of which strategy to apply in the 
competition based upon the results of the training samples. 
In the framework we have outlined this choice is buried in 
the performance measure through the function A{d m ). Recall 
that A{d m ) is a function which, given a sample of games and 
outcomes, returns a probability distribution over a subset of 
X. In the case where A(d m ) is deterministic and selects the 
champion strategy x * based on d m then the subset output by 
A is | x £ possible opponent responses to x*}. 

If A is deterministic the natural empirical measure of the 
performance of the search algorithm a obtained during training 
is 

C = min fix). 

xEA(d m ) (~| <^m 

Though we shall not pursue it here, it is a simple matter to 
allow for non-deterministic A. In such cases A(d rn ) might 
stochastically define an optimal strategy x* through specifica- 
tion of d m -dependent probability density p(x* \d m ) over X_. In 
this situation, performance could be defined as the weighted 

s Note that the space of opponent strategies varies with x. This is the typical 
situation in applications to games with complex rules (e.g., checkers). 


average 

V p(x*\d m ) min f(x*,x), 

x*GX x€X(x‘) 

where the min over x is over possible opponent responses 
to x* . It is also straightforward to include a distribution over 
opponent responses if that were known. 

To summarize, search algorithms are defined exactly as 
in classical NFL, but performance measures are extended to 
depend both on / and A. The best a for a particular / and A 
are those that maximize C. 

The original version of NFL (for traditional optimization) 
defines the performance differently because there is no oppo- 
nent. In the simplest case, the performance of a (recall that 
there is no champion-selecting procedure) might be measured 
as C = max t€ [ l m ] d^(t). One traditional NFL result states 
that the average performance of any pair of algorithms is 
identical, or formally, P(C\f,m, a) is independent of a. 7 
A natural extension of this result considers a non-uniform av- 
erage over fitness functions. In this case the quantity of interest 
is J2f P((7|/, m, a)P(f) where P(f) weights different fitness 
functions. 

A result akin to this one in the self-play setting would state 
that the unform average Yf F(C | /, m, a, A) is independent 
of a. However, as we have seen informally, such a result cannot 
hold in general since a search process with an a that exhausts 
an opponent’s repertoire of strategies has better guarantees 
than other search processes. A formal proof of this statement 
is presented in section VI. 

B. An Exhaustive Example 

Before proving the existence of free lunches we provide a 
small exhaustive example to illustrate our definitions, and to 
show explicitly why we expect free lunches to exist. Consider 
the case where the player has two possible strategies, i.e., 
X_ — {1,2}, the opponent has two responses for each of 
these strategies, i.e., X = {1,2}, and there are two possible 
fitness values, Y = {1/2,1}. The 16 possible functions 
are listed in Table I. We see that the maximin criteria we 
employ gives a biased distribution over possible performance 
measures: 9/16 of the functions have g = [l/2 1/2], 3/16 

have g = [l/2 l], 3/16 have g = [l 1/2], and 1/16 have 

9 = [1 1] where g = [g(x = 1) g(x = 2)] . 

If we consider a particular population, say d 2 — 
{(1, 2; 1/2), (2, 2; 1)}, the payoff functions that are consistent 
with this population are fg, /io, / 13 , /14 and the corresponding 
distribution over g functions is S(g — [1/2 1/2] )/2 + 6 (g — 
[1/2 l])/2. Given that any population will give a biased 
sample over g functions, it may not surprising that there are 
free lunches. We expect that an algorithm which is able to 
exploit this biased sample would perform uniformly better than 
another algorithm which does not exploit the biased sample of 
p’s. In the next section we prove the existence of free lunches 
by constructing such a pair of algorithms. 

’Actually far more can be said, and the reader is encouraged to consult [?] 
for details. 
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TABLE I 

Exhaustive enumeration of all possible functions f(x,x) and g(x) = mini /(x,i) for X = {1,2}, X = {1,2}, and Y = {1/2, 1}. The 

PAYOFF FUNCTIONS LABELED IN BOLD ARE THOSE CONSISTENT WITH THE POPULATION dj = {(1, 2; 1/2), (2,2; 1)}. 


VL Construction of Free Lunches 

In this section a proof is presented that there are free 
lunches for self-play by constructing a pair of search algo- 
rithms such that one explicitly has performance equal to or 
better than the other for all possible payoff functions /. We 
normalize the possible Y values so that they are equal to 
1/|Y|, 2/|Y|, ■ ■ ■ ,1. Thus, regardless of how Y values are 
assigned by the fitness function, our measure gives the fraction 
of possible fitness values having lesser or equal fitness, and 
thus forms a sort of normalized ranking. 

As discussed earlier, we assume that all l agent strategies 
offer the same number of possible opponent responses, l. We 
consider algorithms that explore m = l distinct joint samples. 
Agent strategies are labeled by x € {1, • • • 1} and opponent 
responses are labeled by x G {1, - - • 7}. For simplicity we take 
1 = 1 . 

In the following section we consider three different algo- 
rithms and show different expected performance for all of 
them. For those not interested in the details of the derivation 
of the performances a summary of results appears at the end 
of the section. 

A. Algorithms Having Different Expected Performance 

Algorithm a\ explores the joint strategies (1, 1), • ■ • , (1 ,m) 
and algorithm a 2 explores the joint strategies 

(1, 1), ■ ■ • , (m, 1), i.e., ai exhausts opponent responses 
to x = 1 while 02 only samples one opponent response to 
each of it’s rn possible strategies. For the champion-selection 
rule, A(d m ), we apply the Bayes optimal rule: select the 
strategy x that has the highest expected g(x) when averaged 
uniformly over payoff functions consistent with the observed 
population. 

To start, we determine the expected performance of an algo- 
rithm that does not have the benefit of knowing any opponent 
responses. In this case we average the performance, g(x), for 
any element x, over all |YJ- functions. 8 We note that for any 
given agent strategy x, the [Y^ possible function values at 
the joint strategies (x, •) are replicated |Y|1*/|Y|* = |yp(I -1 ) 
times. The number of times that a g(x) value of 1 - i[\Y\ is 
attained in the first |Y|* distinct values is ( i + l) i - i l . Thus 
the average g(x) value, which we denote (g), is 

8 RecaIl that \X\ = ll 


where nj(i) = [(i + lj/jYI] 1 ~ [VI *1] • This average value 
is obtained for all strategies x. In the continuum limit where 
\Y\ — > 00 the expected value of g is simply 

(9) = 1/(1 + 0 - 


This serves as a baseline for comparison; any algorithm that 
samples some opponent responses has to do better than this. 

Next we consider the algorithm 01 , which exhaustively 
explores all opponent responses to x = 1. Because m = l 
there are |Y/ possible d rn that this algorithm might see. For 
each of these sample sets, d rn , we need to determine g( 1), and 
the average g values for each of the other strategies 1 / 1. This 
average is taken over the \Y I'd- 1 ) functions that are consistent 
with d m . Of course we have g{ 1) = min d^ x and the expected 
g(x) value for x f= 1 are all equal to (g) (since we have 
no samples from any strategies x f= 1). Since the champion- 
choosing rule maximizes the expected value of g the expected 
performance of a 1 for this sample set is max(min d y m , (g)j. 
Averaged over all functions the expected performance of a 1 

is 1 

(9) 1 = — = ]Tmax(min<4,<ci)) 

' Y ' <C 


where the sum is over all IYI* possible samples. Converting 
the sum over all samples into a sum over the minimum value 
of the population we find 


m-i 

(9)1= max ( 1 

i=0 

im<i-<s»J 

- E (‘ 

t=0 


^ \Y\' 
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W\ 


(9) nj(i) 
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| Y |-1 

j(i) + (9) n T^- 

i=nr|(i-(9»l 


If we define i g = [|Y|(1 - (g)) 1 then we obtain 



In the continuum limit we have 


(<?>i = (1 - (a)) 71 -^ + (9)( 1 - (1 - (g)) T ) 


1 

1+7 


1 + 



1 +h 


where we have recalled the expected value {g) = 1/(1 + 0- 
We note that as I — > 00 the performance of algorithm a! is 
(1 + e -1 ) times that of (g). 
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The analysis of algorithm a 2 is slightly more complex. In 
this case each game occurs at a different x. For any given 
observed set of samples the optimal strategy for the agent is 
to choose that x* which has the largest fitness observed in the 
population. With this insight, we observe that when summing 
over all functions, there are |T"| i— 1 possible completions to 
rnax dJf for the remaining 7—1 unobserved responses to x*. 
We must take the minimum over these possible completions to 
determine the expected value of g. Thus, the expected payoff 
for algorithm a 2 when averaging over all functions is 

1 |y|_1 i 

(9 )2 = — + E E min (max , 1 - 

|r 1 d y m *= 0 1 1 


We proceed in the same fashion as above by defining 9 id = 
|Y|(1 — maxd^J (which depends on dlff) so that 


1 T ld ~ l ^ 1 IVI — » 

( 9)2 = E [maxd^ ^ «£_!(*) + E 

dm -> n 


\Y\ l 


i =0 t=id 

1-1 \y\-i 


maxdE 


d y 

u m 


\Y\ 


V- \ Y \~ i 

+ E — jyj - "T-iW 


The sum over populations is now tackled by converting it to a 
sum over the |F| possible values of max <l/ u . The number of 
sequences of length 7 having maximum value j is j l — (j — 1) ; . 
Moreover, if maxd^, = j/\Y\ then id = |T| — j and so 


the normalized ranking so that, e.g., the configuration having 
fitness 1/2 is fitter than half of all possible fitnesses. 

Performance is measured by the maximin criterion (i.e., 
the worst-case performance of the strategy against an oppo- 
nent), and averaged over all possible fitness functions. Three 
algorithms were considered: random search which random 
selects 7 distinct training games; algorithm ai which applies 
a single given strategy and determines the opponents best 
response to that strategy; and algorithm a 2 which samples a 
single opponent response to all its 7 possible strategies. In 
all cases the champion strategy is selected with the Bayes 
optimal rule which chooses the strategy having the highest 
expected performance given the observed data. The expected 
performance in each of these algorithms is: 


0.1 ■ 


a 2 ■ 


Random: 


1 

1 + 1 


1 

1+1 




1+7-! 


B(l + l,l) + l/l- B(l,l) 


where B(x,y) = r(x)T(y)/T(x + y). 

Figure 1 plots the expected performance of aj, a 2 , and 
random search as a function of 7 (recall that m = l = 7). 
Algorithm ai outperforms algorithm a 2 on average for all 
values of 7. 


B. Performance Difference 
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The continuum limit in this case is found as 
(9)2 = 7 [ dy : y 1 .- 1 ( yff 1 - yy) i_1 + 


nj(j -1) 


(l - 1) [ dy (l — y)y l 2 1 
Ji~ yj J 


-1 


Though ai outperforms a 2 on average, it is interest- 
ing to determine the fraction of functions where ai will 
perform no worse than a 2 . This fraction is given by 
l*r^£/0(P e rfi(/) -P erf 2 (/)) where P erf i (/) is the per- 
formance of algorithm 01 on payoff function /, perf 2 (/) is the 
performance of algorithm a 2 on the same /, and 6 is a step 
function defined as 9(x) = 1 if x > 0 and 9(x) = 0 otherwise. 
The Bayes optimal payoff for ai for any given payoff function 
/ is 10 


= [ dyj y l j( l -yj ) 1 X + f dyjy 1 - 
Jo Jo 

- fdy j y l r\l-y j ) 1 - 1 

J 0 

= 5(7+ 1,7) + 1/7- 5(7,7) 

where B(x,y) is the beta function defined by B(x,y ) = 
r(x)r(7/)/r(+ + y). For large 7 the Beta functions almost 
cancel and the expected performance for a 2 varies as 1/7, 
which is only slightly better than the performance of the 
algorithm that does not have access to any training data. 

Summary of results: 

For reference we summarize these results, and the condi- 
tions under which the results have been derived. We have 
considered a two player game where the player and opponent 
each have 7 possible strategies available to them. Training 
algorithms sample m = 7 distinct games and their fitnesses. 
Fitness values lie uniformly between 0 and 1, and measure 

9 There is no need to take the ceiling because id is automatically an integer. 


perfi(/) = 


if x) > ( g ) 

otherwise 


minx/(l,z) 

^minj /(2,x) 

Similarly, the performance of algorithm a 2 is given by 

perf 2 (/) = mmf(x2,x) 

X 

where is the strategy having the highest fitness observed 
in the sample games d m . 

To determine the performance of the algorithms for any 
given / we divide / into its relevant and irrelevant components 
as follows: 

ji 

\Y\ 

^1 * r r/i — M~ / 1 1 ^2 


= K 1 . 1 ). fpj 


|y| -nhn{/(l,x)|x^ 1}, 


/( 2 , 1 ) 

min{/(2,x)|x f 1} 

X 

max{/(x, l)|x 7^ 1,2}, 


n 

m 

= mm{f(x2,x)\x2 7 ^ 1 } 

\Y | * 

10 We have arbitrarily assumed that ai will select strategy 2 if it does not 
select strategy 1. This choice has no bearing on the result. 
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Fig. 1. Expected performance of algorithm ai (indicated as which exhaustively enumerates the opponents response to a particular strategy, and 

algorithm a 2 (indicated as (5)2). which samples only one opponent response to each strategy. For comparison, we also plot ( g ), which is the expected 
performance of an algorithm that does no sampling of opponent responses. 


In the definition of p, x 2 is the strategy chosen by <3,2; if a 2 
does not choose strategy x% = 1 or 2, the specific value of x% 
is irrelevant. Given these definitions, the performances of the 
two algorithms are 


P erf i(/) 
and 


1 f minfji , /=! ) if mm(j 1 ,k 1 ) > (F|(p) 
1^1 |min(j 2 ,fc 2 ) otherwise 


rmin(j l5 A:i) 
perf 2 (/) = jyj \ min(j 2 , fc 2 ) 
[min(n, p ) 


if ma x(ji,j 2 ,n) = j x 
if ma x(ji,j 2 ,n) = j 2 
otherwise 


respectively. 

In summing the above expressions over / we replace the 
sum over / with a sum over j 1, j 2 , k\, k 2 , n, and p using 
the appropriate multiplicities. The resulting sums are then 
converted to integrals in the continuum limit and evaluated 
by Monte Carlo. Details are presented in Appendix A. 

The results are shown in Figure 2, which plots the fraction 
of functions for which perf x > perf 2 . This plot was generated 
using 10 7 Monte Carlo samples per l value. 


C. Oilier Champion-Selection Criteria 

We have shown the existence of free lunches for self-play by 
constructing a pair of algorithms with differing search rules ai 
and a 2 , but with the same champion-selecting rule (select the 
strategy with the highest expected g(x)), and showed different 
performance. Unsurprisingly, we can construct algorithms with 
different expected performance which have the same search 
rules, but which have different champion-selecting rules. In 
this section we provide a simple example of such a pair of 
algorithms. This should help demonstrate that free lunches are 
a rather common occurrence in self-play settings. 


Each process of the pair we construct use the same search 
rule a (it is not important in the present context what a is), but 
different deterministic champion-selecting rules A. 11 In both 
cases a Bayesian estimate based on uniform P( f ) and the d m 
at hand is made of the expected value of g(x) = min s / (x, x) 
for each x. Since we strive to maximize the worst possible 
payoff from /, the optimal champion-selection rule selects the 
strategy that maximizes this expected value while the worst 
champion-selection rule selects the strategy that minimizes 
this value. More formally, if E(C|d m ,a, A) differs for the 
two choices of A, always being higher for one of them, then 
E(C|m, a, A) = P(d m |a)E(C'!d m ,A) differs for the 
two A. In turn, 

E(C\m,,a,A) = ]P[C x P(C \ f,m,a,A) x P(f)] 
f,c 

oc ]T[<T x P(C\f,m,a,A)] 
f,c 

for the uniform prior P{f). Since this differs for the two A, 
so must I /> TO > a ’ =^)- 

Let g(x ) be a random variable representing the value of g(x) 
conditioned on d m and x, i.e., it equals the worst possible 
payoff (to the agent) after the agent applies strategy x and 
the opponent replies. In the example of section V-B we have 
Eg(l) = 1/2 and Eg(2) = 3/4 

To determine the expected value of g(x) we need to know 
P{g(x) | x,d m ) = £/ P{g{x) | x,d m J)P(f) for uniform 
P(f). Of the entire population d m only the subset sampled 
at x is relevant. We assume that there are k(x, dm) £ m 
such values. 12 Since we are concerned with the worst possible 

^The notation A is meant to be suggestive of the fact that A(dm) is the 
x (first) component common to ail joint configurations in A{dm)- 

12 Of course, we must also have k(x,dm) < f° r ^ populations dm- 
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Fig. 2. The fraction of functions in the continuum limit where per^ > perf 2 The figure was generated with 10 7 Monte Carlo samples of the integrand for 
each value of l. 


opponent response let r(x,d m ) be the minimal Y value 
obtained over the k(x,d m ) responses to x, i.e. r(x, d m ) = 
min^g^ x). Since payoff values are normalized to lie 
between 0 and 1, 0 < r(x,d m ) < 1. Given k(x,d m ) and 
r(x, d rn ), P(g | x. d m ) is independent of x and d rn and so we 
indicate the desired probability as -Kk,r{g)- 

In appendix B we derive the probability 7 Tfc >r in the case 
where all Y values are distinct (we do so because this results 
in a particularly simple expression for the expected value of g ) 
and in the case where Y values are not forced to be distinct. 
From these densities we the expected value of g(x) can be 
determined. In the case where Y values are not forced to be 
distinct there is no closed form for the expectation. However, 
in the continuum limit where |y| — > oo we find (see appendix 


C) 


E(o(z) 


1 - (1 - r(x, d ni ))7Cg)-fc(a.<W+i 
l(x) - k(x, dm) + 1 


(5) 


where we have explicitly noted that both k and r depend both 
on the strategy x as well as the training population d m . As 
shorthand we define C m (x) = E (g(x) \ x, d m ). 

The best strategy given the training population is the deter- 
ministic choice A^^dm) = arg ma x x C m (x) and the worst 
strategy is A W0Rt (d m ) = arg min ^C m (x). In the example of 
section V-B with the population of size 2, A best (u 2 ) = 2 and 

— worst(^2) 1* 

As long as C m (x) is not constant (which will usually be 
the case since the r values will differ) the performances of 
the two champion-selecting rules will differ, and the expected 
performance of A best will be superior. 


D. Better Training Algorithms 

In the previous sections we constructed Bayes-optimal algo- 
rithms in limited settings by using specially constructed deter- 
ministic rules a and A. This alone is sufficient to demonstrate 


the availability of free lunches in self-play contexts. However, 
we can build on these insights to construct even better (and 
even worse) algorithms by also determining (at least partially) 
the Bayes-optimal search rule, (a. A), that builds out the 
training set, and selects the champion strategy. That analysis 
would parallel the approach taken in [?] used to study bandit 
problems, and would further increase the performance gap 
between the (best, worst) pair of algorithms. 


E. The Role of Opponent “Intelligence ” 

All results thus far have been driven by measuring perfor- 
mance based on g(x) = arg min ^f(x,x). This is a very 
pessimistic measure as it assumes that the agent’s opponent 
is omniscient, and will employ the strategy most detrimental 
to the agent. If the opponent is not omniscient and cannot 
determine x* = arg min-/(z, x), how does this affect the 
availability of free lunches? 

Perhaps the simplest way to quantify the intelligence of 
the opponent is through the fraction, a, of payoff values 
known to the opponent. The opponent will use these known 
values to estimate its optimal strategy x* . The a = 1 limit 
corresponds to maximal intelligence where the opponent can 
always determine x * and, as we have seen, gives free lunches. 
In the a = 0 limit the opponent can only make random replies, 
and so that the expected performance of the agent will be the 
average over the opponent’s possible responses. 

One way to approach this problem is to build the opponent’s 
bounded intelligence into the agent’s payoff function g and 
proceed as we did in the omniscient case. If |AT| is the number 
of joint strategies, then there are (J^) possible subsets of 
joint strategies of size o[X|. 13 We indicate the list of possible 
subsets as S(X, a\X\), and a particular subset by <S, e S. For 

13 We assume that a is an integral multiple of l/\X\. 
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this particular subset, x* is estimated by selecting the best 
response out of the <S; payoff values known to the opponent. 
Of course, it may be the case that there are no samples in Si 
having the agent’s strategy x and in that case the opponent 
can only select a random response. In this case the agent will 
obtain the average payoff If we assume that 

all subsets of size a\X\ are equally likely, then the agent’s 
payoff function against an opponent with bounded intelligence 
is given by 


9 a (x) 



-1 

argmin/(x,5). 
<Si€.S(X,a|X|) te> x ) €5 ‘ 


This generalization reduces to the previously assumed g in the 
maximally intelligent a = 1 case. In Table II the functions 
p 1 / 4 , 5 2 ^ 4 . <? 3 ^ 4 > and g are listed for the example of 
section V-B. As expected the payoff to the agent increases 
with decreasing a (a less intelligent opponent). However, we 
also observe that for the same population, d 2 , the average 
[g{x = 1) g(x = 2)] values are [5/8 7/8] for a = 1/4, 
[29/48 41/48] T for a = 2/4, [9/16 13/16] T for a = 
3/4, and [l/2 3/4] T for a = 4/4. For this population, d 2 
(a, A(,est) continues to beat (a,A worst ) by the same amount 
independent of a. 


VII. Conclusions 

We have introduced a general framework for analyzing NFL 
issues in a variety of contexts. When applied to self-play we 
have proven the existence of pairs of algorithms in which one 
is superior for all possible joint payoff functions /. This result 
stands in marked contrast to similar analyses for optimization 
in non-self-play settings. Basically, the result arises because 
under a maximin criteria the sum over all payoff functions 
/ is not equivalent to a sum over all functions min*//, ir). 
We have shown that for simple algorithms we can calculate 
expected performance over all possible payoff functions and 
in some cases determine the fraction of functions where one 
algorithm outperforms another. On the other hand, we have 
also shown that for the more general biological coevolutionary 
settings, where there is no sense of a “champion” like there 
is in self-play, NFL still applies. 

Clearly we have only begun an analysis of coevolutionary 
and self-play optimization. Many of the same questions posed 
in the traditional optimization setting can be asked in this more 
general setting. Such endeavors may be particularly rewarding 
at this time given the current interest in the use of game theory 
and self-play for multi-agent systems [?]. 

Appendix 

A. Performance Comparison 

In this appendix we evaluate the fraction of functions for 
which aj performs better or equal to algorithm a 2 where ai 
and a 2 are defined as in Section VI-B. 

The function 6(perf 1 (f) - perf 2 {f)) is equal to 1 if 

cdj + eicd 2 + e 2 cdid 2 + efcdi + cd 2 + e 3 cdid 2 


ii 

1 

h 

1 

ki 

(jyj _ ki + l) 1 - 1 - (lYj - fcl)'- 1 


(|y|-fc 2 + i) , - 1 -(h-fc 2 )'- 1 

n 

„I-2 _ (n _ 1} !-2 

V 

(|y| - p + i)'- 1 - (|y| -p)' _1 


TABLE HI 

Multiplicities occurring when converting the sum over f to a 

SUM OVER THE ALLOWED VALUES OF Jj , j 2 , fcl , k 2 , l, AND p. 


where c = (min(ji,fci) > |F|{p)), di = (max(ji, j 2 , ri) = 
ji), d 2 = (max(ji, j 2 , n) = j 2 ), ei = (min(ji,fci) > 
min(j 2 ,fc 2 )), e 2 = (min(ji,/ci) > min(n,p)), e 3 = 
(mm(/ 2 , k 2 ) > min(n,p)). In the above Boolean expression 
we have used the condensed notation ab = a Ah, a + b = aVb, 
and a = ~^a. It is convenient to factor the Boolean expression 
as 


c(di + e\ d 2 + e 2 did 2 'j 4- c(eidi + d 2 + e^dido). 


To give the fraction of functions where aj performs better 
than a 2 this expression is to be summed over j\, j 2 , k\, k 2 , 
n, and p with appropriate multiplicities. The multiplicities are 
given in Table III. 

In the continuum limit this sum becomes the integral 


[ dji [ dj 2 [ dkiP(ki) [ dk 2 P(k 2 ) f dnP(n)x 

Jo Jo Jo Jo Jo 

I dpP(p){c(di +eid 2 + e 2 did 2 ) + c(eic(i +d 2 + e$did 2 )} 
Jo 


where P{k\) = Q- 1)(1 - &i/ -2 , P(k 2 ) ~ (I-l)(l — k 2 ) 1 ^", 
P(n) = (Z — 2)n i_3 , P(p) = (Z-l)(l-p)' -2 , and condition c 
is modified to min(ji, ki) > (g). Though this integral is dif- 
ficult to evaluate analytically, it is straightforward to evaluate 
by Monte Carlo importance sampling of (jiJ 2 ,ki,k 2 ,n,p) 
using the respective probability distributions. Samples from 
P(u) = q( 1 - u) q ~ 1 are obtained by sampling values v from 
17(0, 1) and transforming so that u = 1 — v 1 ^; samples from 
P(w) — qw q ~ 1 are obtained via w = v l ' q . 


B. Determination ofirk >T {g) : distinct Y 

To determine w, t, r (p) we first consider the case where all 
Y values are distinct and then consider the possibility of 
duplicate Y values. Though we only present the non-distinct 
case in the main text we derive the distinct Y case here because 
we can obtain a closed-form expression for the probability and 
because it serves as a simpler introduction to the case of non- 
distinct Y. 

To derive the result we generalize from a concrete example. 
Consider the case where [F| = 10, Z(x) = 5, and k = 3. A 
particular instantiation is presented in Figure 3. In this case 
r = 4/10, which is not the true minimum for responses 
to x. The probability that r is the true minimum is simply 
k/l(x). If r is not the true minimum then P(g\d m ) is f° u nd 
as follows. P(g = l/10|cZ m ) is the fraction of functions 
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TABLE H 

Exhaustive enumeration of all 16 possible agent payoffs, g a {x = l),s“(i = 2), for boundedly intelligent opponents having 

INTELLIGENCE PARAMETER a = 1/4, a = 2/4, a = 3/4, AND a = 4/4, SEE TABLE I FOR THE CORRESPONDING / FUNCTIONS AND FOR THE a = 1 g 
FUNCTION. The PAYOFF FUNCTIONS LABELED IN BOLD ARE THOSE CONSISTENT WITH THE POPULATION di = {(1, 2; 1/2), (2, 2; 1)}. 


Y 

1/10 

2/10 

3/10 

4/10 

5/10 

6/10 

7/10 

8/10 

9/10 

10/10 

/(*» •) 



* 

* 


* 



* 


d y m at x 




* 


* 



* 


P(g\dm ) 

6/21 

5/21 

4/21 

6/21 

0 

0 

0 

0 

0 

0 


Fig. 3. Row 1 indicates the Y values obtainable on a particular payoff function / for each of the l(x) = 5 possible opponent responses. Row 2 gives the 
Y values actually observed during the training period. Row 3 gives the probabilities of g assuming a uniform probability density across the / which are 
consistent with d m . The expected value of P(g\d m ) is 2.48/10. 


containing Y values at {1/10}U dm~- H Since the total number 
of possibilities consistent with the data is ) this fraction 

is (iS-k-i )/(£)-*) = (*&) ~ k )K\ Y \ ~ k )• Similarly, 
p (g = 2/ 10 1 dm) is because we 

that the function can not contain a sample having fitness less 
than 2/10. 

Thus, in the general case, we have 


TTk,r(g) 



9(r - g) 


Cr-TKC"^* 1 )} 


where a — \Y\ — k, b = l(x) — k, 9(x) = 1 iff x > 0, and 
5g tr = 1 iff g — r. Since it is easily verified that 

this probability is normalized correctly. The expected value of 
g is therefore 


C. Determination of ttk,r{g): non-distinct Y 

In Figure 4 we present another example where l(x) = 5, 
k = 3, and r = 4/10. In this case, however, there are duplicate 
Y values. The total number of functions consistent with the 
data is |y|h*)-fc = |Y| 6 . In this case it is easiest to begin 
the analysis with the case g = r. The number of functions 
having the minimum of the remaining b points equal to |Y| 
is 1. Similarly, the number of functions having a minimum 
value of (|Y| — 1) is 2 6 — 1. 2 b counts the number of functions 
where the b function values can assume one of Y or Y — 1. 
The —1 accounts for the fact that 1 of these functions has a 
minimum value of Y and not Y — 1. Generally, the number 
of functions having a minimum value of r' is (|Y| — |Y|r' + 
l) b — (| Y| — lYIr') 6 . All r' > r will result in the minimal 
observed value r so that the total number of functions having 
an observed minimum of r is 

m 

]T[(|Y| - \Y\r' + l) h - (| Y| - \Y\r') b ] = (|Y| - \Y\r + l) fc . 

r' —r 


E( g\d m ) ~ 


Evaluating this sum we find 


a — |Y|r + 1 
b 


|V|r-l 

-E 

5'= i 


~,( a -g 

6-1 




\Y\ 


" , (f:r, 1 V('“ + , 1 ; 1 |r|r ^ 

\y/ J L\ u 'T-- L / v u ~r ± / ) 

I Y| — 1 (q + l)h±i - (a + 1 - |Y|r)^hl 

6+1 ak 


where the falling power, a-, is defined by a- = a(a - 1) (a - 
2) • • • (a -6+1). For the case at hand where |Y| = 10, l(x) = 
5, and k — 3 we have a = 7 and 6 = 2. Since r = 4/10 the 
expected value is 'E(g\d m ) = ^ (8- - 4-) / (3 • 7-) = 52/21 « 
2.48/10. 


Thus the probability of <jr = r is 

7r fc , r (3 = r) = |Y|- 6 (|Y|-|Y|r + l) fc . 


We turn now to determining the probabilities where g < r. 

Of the 6 remaining Y values the probability that the mini- 
mum is g is 

**,r(fl) = |F|- 6 {(\Y\ - \Y\g + l) 6 - (\Y\ - |Y|5) h } . 


Combining these results we obtain the final result 


7Tfc,r(<7) = 0(r -5){(l -9+ |y|) 
1 1 b 


+ 


^r,g 


1 — r 


\Y 


■ t 


14 By dt- we mean the set of Y values sampled at x. 
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Y 

1/10 

2/10 

3/10 

4/10 

5/10 

6/10 

7/10 

8/10 

9/10 

10/10 

/(£, •) 



* 

* 


** 



* 


d y m atx 




* 


** 





P(g\dm ) 

19/100 

17/100 

15/100 

49/100 

0 

0 

0 

o 

0 

0 


Fig. 4. Row 1 indicates the Y values obtainable on a particular payoff function / for each of the i(x) = 5 possible opponent responses. Row 2 gives the 
Y values actually observed during the training period. Row 3 gives the probabilities of g assuming a uniform probability density across the / which are 
consistent with <f m . Note that unlike Fig. 3 there are some duplicate Y values. The expected value of P{g\dm) is 2.94/10. 


Given 7r^ r (<7) the expectation value of g is found as 
r - 1/|Y| - • lx* 




§=i/\y\ 


- £ (>-(•' 


T-' = 1/|V| 


\Y\ 


where we have cancelled appropriate terms in the telescoping 
sum. If we define Sk(n) = i then we can evaluate the 
last sum to find 


E(g\d m ) = |yr 6 {5 t (|F|) - S b (\Y\ - \Y\r)}. 

Though there is no closed form expression for Sk(n), a 
recursive expansion of S*(n) in terms of Sj(n) for j < k 
is 

*>(») - I )^-(»)}- 

The recursion is based upon So (n) = n. 

In the concrete case above where |Y| = 10, r = 4/10, and 
b = 2 the expected value is -Y 294/ 100 = 2.94/10. 


D. Continuum Limit 

In the limit where |Y| — ► oo we can approximate the 
expectation E(<j|d m ) given by the sum 

r r - i/|>n 

E(g\d rn )= £ (i _ (r '- i/l^l )) 6 = £(l-r') h 

r '= l /| y | r '=0 

by the integral 

E(5|dm) = J dr (1 - r') b = {l - (1 -r) 6+1 } . 

( 6 ) 

The prediction made by this approximation at |Y| = 10, r = 4, 
and b = 2 is 2.61/10 as opposed to the correct result of 
2.94/10. However, had |Y| = 1000 and r = 400 the accurate 
result would have been 261.65/1000 while the approximation 
gives 261.3/1000. 



